{
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.9-final"
  },
  "orig_nbformat": 2,
  "kernelspec": {
   "name": "python3",
   "display_name": "DataAnalysis"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2,
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nltk\n",
    "import re"
   ]
  },
  {
   "source": [
    "# Ch07 从文本提取信息\n",
    "\n",
    "学习目标\n",
    "\n",
    "1.  从非结构化文本中提取结构化数据\n",
    "2.  识别一个文本中描述的实体和关系\n",
    "3.  使用语料库来训练和评估模型"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "## 7.1 信息提取\n",
    "\n",
    "从文本中获取意义的方法被称为「信息提取」\n",
    "\n",
    "1.  从结构化数据中提取信息\n",
    "2.  从非结构化文本中提取信息\n",
    "    -   建立一个非常一般的含义\n",
    "    -   查找文本中具体的各种信息\n",
    "        -   将非结构化数据转换成结构化数据\n",
    "        -   使用查询工具从文本中提取信息"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "### 7.1.1 信息提取结构\n",
    "｛原始文本（一串）｝→ 断句 \n",
    "\n",
    "→｛句子（字符串列表）｝→ 分词 \n",
    "\n",
    "→｛句子分词｝→ 词性标注 \n",
    "\n",
    "→｛句子词性标注｝→ 实体识别 \n",
    "\n",
    "→｛句子分块｝→ 关系识别 \n",
    "\n",
    "→｛关系列表｝"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "['BBDO South', 'Georgia-Pacific']\n"
     ]
    }
   ],
   "source": [
    "# P283 图7-1，Ex7-1：信息提取结构元组（entity, relation, entity)\n",
    "# 建立标准化数据库\n",
    "locs = [('Omnicom', 'IN', 'New York'),\n",
    "        ('DDB Needham', 'IN', 'New York'),\n",
    "        ('Kaplan Thaler Group', 'IN', 'New York'),\n",
    "        ('BBDO South', 'IN', 'Atlanta'),\n",
    "        ('Georgia-Pacific', 'IN', 'Atlanta')]\n",
    "# 依据查询提取信息\n",
    "query = [\n",
    "        e1\n",
    "        for (e1, rel, e2) in locs\n",
    "        if e2 == 'Atlanta'\n",
    "]\n",
    "print(query)"
   ]
  },
  {
   "source": [
    "## 7.2 分块：用于实体识别的基本技术（P284 图7-2）\n",
    "\n",
    "分块构成的源文本中的片段不能重叠\n",
    "\n",
    "-   小框显示词级标识符和词性标注\n",
    "-   大框表示组块（chunk），是较高级别的程序分块\n",
    "\n",
    "分块的方法\n",
    "\n",
    "-  正则表达式和N-gram方法分块；\n",
    "-  使用CoNLL-2000分块语料库开发和评估分块器；"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "### 7.2.1 名词短语分块（NP-chunking，即“NP-分块”）寻找单独名词短语对应的块\n",
    "\n",
    "NP-分块是比完整的名词短语更小的片段，不包含其他的NP-分块，修饰一个任何介词短语或者从句将不包括在相应的NP-分块内。\n",
    " \n",
    "NP-分块信息最有用的来源之一是词性标记。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(S\n  (NP the/DT little/JJ yellow/JJ dog/NN)\n  barked/VBD\n  at/IN\n  (NP the/DT cat/NN))\n"
     ]
    }
   ],
   "source": [
    "# P285 Ex7-1 基于正则表达式的NP 分块器\n",
    "# 使用分析器对句子进行分块\n",
    "sentence = [('the', 'DT'),\n",
    "            ('little', 'JJ'),\n",
    "            ('yellow', 'JJ'),\n",
    "            ('dog', 'NN'),\n",
    "            ('barked', 'VBD'),\n",
    "            ('at', 'IN'),\n",
    "            ('the', 'DT'),\n",
    "            ('cat', 'NN')]\n",
    "# 定义分块语法\n",
    "grammar = 'NP: {<DT>?<JJ>*<NN>}'\n",
    "# 创建组块分析器\n",
    "cp = nltk.RegexpParser(grammar)\n",
    "# 对句子进行分块\n",
    "result = cp.parse(sentence)\n",
    "# 输出分块的树状图\n",
    "print(result)\n",
    "result.draw()"
   ]
  },
  {
   "source": [
    "### 7.2.2. 标记模式"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(S\n  (NP another/DT sharp/JJ dive/NN trade/NN figures/NNS)\n  (NP any/DT new/JJ policy/NN measures/NNS)\n  (NP earlier/JJR stages/NNS)\n  (NP Panamanian/JJ dictator/NN Manuel/NNP Noriega/NNP))\n"
     ]
    }
   ],
   "source": [
    "# 华尔街日报\n",
    "sentence = [('another', 'DT'),\n",
    "            ('sharp', 'JJ'),\n",
    "            ('dive', 'NN'),\n",
    "            ('trade', 'NN'),\n",
    "            ('figures', 'NNS'),\n",
    "            ('any', 'DT'),\n",
    "            ('new', 'JJ'),\n",
    "            ('policy', 'NN'),\n",
    "            ('measures', 'NNS'),\n",
    "            ('earlier', 'JJR'),\n",
    "            ('stages', 'NNS'),\n",
    "            ('Panamanian', 'JJ'),\n",
    "            ('dictator', 'NN'),\n",
    "            ('Manuel', 'NNP'),\n",
    "            ('Noriega', 'NNP')]\n",
    "grammar = 'NP: {<DT>?<JJ.*>*<NN.*>+}'\n",
    "cp = nltk.RegexpParser(grammar)\n",
    "result = cp.parse(sentence)\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Grammar中输入语法，语法格式{<DT>?<JJ.*>*<NN.*>+}，不能在前面加NP:，具体可以参考右边的Regexps说明\n",
    "# Development Set就是开发测试集，用于调试语法规则。绿色表示正确匹配，红色表示没有正确匹配。黄金标准标注为下划线\n",
    "nltk.app.chunkparser()"
   ]
  },
  {
   "source": [
    "### 7.2.3 用正则表达式分块（组块分析）"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(S\n  (NP Rapunzel/NNP)\n  let/VBD\n  down/RP\n  (NP her/PP$ long/JJ golden/JJ hair/NN))\n"
     ]
    }
   ],
   "source": [
    "# Ex7-2 简单的名词短语分类器\n",
    "sentence = [(\"Rapunzel\", \"NNP\"),\n",
    "            (\"let\", \"VBD\"),\n",
    "            (\"down\", \"RP\"),\n",
    "            (\"her\", \"PP$\"),\n",
    "            (\"long\", \"JJ\"),\n",
    "            (\"golden\", \"JJ\"),\n",
    "            (\"hair\", \"NN\")]\n",
    "# 两个规则组成的组块分析语法，注意规则执行会有先后顺序，两个规则如果有重叠部分，以先执行的为准\n",
    "grammar = r'''\n",
    "  NP: {<DT|PP\\$>?<JJ>*<NN>}   # chunk determiner/possessive, adjectives and noun\n",
    "      {<NNP>+}                # chunk sequences of proper nouns\n",
    "'''\n",
    "print(nltk.RegexpParser(grammar).parse(sentence))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(S\n  Rapunzel/NNP\n  let/VBD\n  down/RP\n  her/PP$\n  (NP long/JJ golden/JJ)\n  hair/NN)\n"
     ]
    }
   ],
   "source": [
    "grammar = r'NP: {<[CDJ].*>+}'\n",
    "print(nltk.RegexpParser(grammar).parse(sentence))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(S\n  (NP Rapunzel/NNP)\n  let/VBD\n  down/RP\n  (NP her/PP$ long/JJ golden/JJ hair/NN))\n"
     ]
    }
   ],
   "source": [
    "grammar = r'NP: {<[CDJNP].*>+}'\n",
    "print(nltk.RegexpParser(grammar).parse(sentence))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(S\n  (NP Rapunzel/NNP)\n  let/VBD\n  down/RP\n  her/PP$\n  (NP long/JJ golden/JJ hair/NN))\n"
     ]
    }
   ],
   "source": [
    "grammar = r'NP: {<[CDJN].*>+}'\n",
    "print(nltk.RegexpParser(grammar).parse(sentence))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "错误分块的结果=  (S (NP money/NN market/NN) fund/NN)\n正确分块的结果=  (S (NP money/NN market/NN fund/NN))\n"
     ]
    }
   ],
   "source": [
    "# 如果模式匹配位置重叠，最左边的优先匹配。\n",
    "# 例如：如果将匹配两个连贯名字的文本的规则应用到包含3个连贯名词的文本中，则只有前两个名词被分块\n",
    "nouns = [('money', 'NN'), ('market', 'NN'), ('fund', 'NN')]\n",
    "grammar = 'NP: {<NN><NN>}'\n",
    "print(\"错误分块的结果= \", nltk.RegexpParser(grammar).parse(nouns))\n",
    "grammar = 'NP: {<NN>+}'\n",
    "print(\"正确分块的结果= \", nltk.RegexpParser(grammar).parse(nouns))"
   ]
  },
  {
   "source": [
    "### 7.2.4 探索文本语料库：从已经标注的语料库中提取匹配特定词性标记序列的短语"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(CHUNK combined/VBN to/TO achieve/VB)\n(CHUNK continue/VB to/TO place/VB)\n(CHUNK serve/VB to/TO protect/VB)\n(CHUNK wanted/VBD to/TO wait/VB)\n(CHUNK allowed/VBN to/TO place/VB)\n(CHUNK expected/VBN to/TO become/VB)\n(CHUNK expected/VBN to/TO approve/VB)\n(CHUNK expected/VBN to/TO make/VB)\n(CHUNK intends/VBZ to/TO make/VB)\n(CHUNK seek/VB to/TO set/VB)\n"
     ]
    }
   ],
   "source": [
    "grammar = 'CHUNK: {<V.*><TO><V.*>}'\n",
    "cp = nltk.RegexpParser(grammar)\n",
    "brown = nltk.corpus.brown\n",
    "count = 0\n",
    "for sent in brown.tagged_sents():\n",
    "    if count < 10:\n",
    "        tree = cp.parse(sent)\n",
    "        for subtree in tree.subtrees():\n",
    "            if subtree.label() == 'CHUNK':\n",
    "                count += 1\n",
    "                print(subtree)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 定义一个搜索函数（一次性返回定义好的数据量）\n",
    "def find_chunks(pattern):\n",
    "    cp = nltk.RegexpParser(pattern)\n",
    "    brown = nltk.corpus.brown\n",
    "    count = 0\n",
    "    for sent in brown.tagged_sents():\n",
    "        if count < 10:\n",
    "            tree = cp.parse(sent)\n",
    "            for subtree in tree.subtrees():\n",
    "                if subtree.label() == 'CHUNK' or subtree.label()=='NOUNS':\n",
    "                    count += 1\n",
    "                    print(subtree)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(CHUNK combined/VBN to/TO achieve/VB)\n(CHUNK continue/VB to/TO place/VB)\n(CHUNK serve/VB to/TO protect/VB)\n(CHUNK wanted/VBD to/TO wait/VB)\n(CHUNK allowed/VBN to/TO place/VB)\n(CHUNK expected/VBN to/TO become/VB)\n(CHUNK expected/VBN to/TO approve/VB)\n(CHUNK expected/VBN to/TO make/VB)\n(CHUNK intends/VBZ to/TO make/VB)\n(CHUNK seek/VB to/TO set/VB)\n"
     ]
    }
   ],
   "source": [
    "grammar = 'CHUNK: {<V.*><TO><V.*>}'\n",
    "find_chunks(grammar)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(NOUNS Court/NN-TL Judge/NN-TL Durwood/NP Pye/NP)\n(NOUNS Mayor-nominate/NN-TL Ivan/NP Allen/NP Jr./NP)\n(NOUNS Georgia's/NP$ automobile/NN title/NN law/NN)\n(NOUNS State/NN-TL Welfare/NN-TL Department's/NN$-TL handling/NN)\n(NOUNS Fulton/NP-TL Tax/NN-TL Commissioner's/NN$-TL Office/NN-TL)\n(NOUNS Mayor/NN-TL William/NP B./NP Hartsfield/NP)\n(NOUNS Mrs./NP J./NP M./NP Cheshire/NP)\n(NOUNS E./NP Pelham/NP Rd./NN-TL Aj/NN)\n(NOUNS\n  State/NN-TL\n  Party/NN-TL\n  Chairman/NN-TL\n  James/NP\n  W./NP\n  Dorsey/NP)\n(NOUNS Texas/NP Sen./NN-TL John/NP Tower/NP)\n"
     ]
    }
   ],
   "source": [
    "grammar = 'NOUNS: {<N.*>{4,}}'\n",
    "find_chunks(grammar)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 定义一个搜索函数（使用生成器）\n",
    "def find_chunks(pattern):\n",
    "    cp = nltk.RegexpParser(pattern)\n",
    "    brown = nltk.corpus.brown\n",
    "    for sent in brown.tagged_sents():\n",
    "        tree = cp.parse(sent)\n",
    "        for subtree in tree.subtrees():\n",
    "            if subtree.label() == 'CHUNK' or subtree.label() == 'NOUNS':\n",
    "                yield subtree"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(CHUNK combined/VBN to/TO achieve/VB)\n(CHUNK continue/VB to/TO place/VB)\n(CHUNK serve/VB to/TO protect/VB)\n(CHUNK wanted/VBD to/TO wait/VB)\n(CHUNK allowed/VBN to/TO place/VB)\n(CHUNK expected/VBN to/TO become/VB)\n(CHUNK expected/VBN to/TO approve/VB)\n(CHUNK expected/VBN to/TO make/VB)\n(CHUNK intends/VBZ to/TO make/VB)\n(CHUNK seek/VB to/TO set/VB)\n"
     ]
    }
   ],
   "source": [
    "grammar = 'CHUNK: {<V.*><TO><V.*>}'\n",
    "for i, subtree in enumerate(find_chunks(grammar)):\n",
    "    if i < 10:\n",
    "        print(subtree)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(NOUNS Court/NN-TL Judge/NN-TL Durwood/NP Pye/NP)\n(NOUNS Mayor-nominate/NN-TL Ivan/NP Allen/NP Jr./NP)\n(NOUNS Georgia's/NP$ automobile/NN title/NN law/NN)\n(NOUNS State/NN-TL Welfare/NN-TL Department's/NN$-TL handling/NN)\n(NOUNS Fulton/NP-TL Tax/NN-TL Commissioner's/NN$-TL Office/NN-TL)\n(NOUNS Mayor/NN-TL William/NP B./NP Hartsfield/NP)\n(NOUNS Mrs./NP J./NP M./NP Cheshire/NP)\n(NOUNS E./NP Pelham/NP Rd./NN-TL Aj/NN)\n(NOUNS\n  State/NN-TL\n  Party/NN-TL\n  Chairman/NN-TL\n  James/NP\n  W./NP\n  Dorsey/NP)\n(NOUNS Texas/NP Sen./NN-TL John/NP Tower/NP)\n"
     ]
    }
   ],
   "source": [
    "grammar = 'NOUNS: {<N.*>{4,}}'\n",
    "for i, subtree in enumerate(find_chunks(grammar)):\n",
    "    if i < 10:\n",
    "        print(subtree)"
   ]
  },
  {
   "source": [
    "### 7.2.5. 添加缝隙：寻找需要排除的成分\n",
    "\n",
    "可以为不包括在大块中的标识符序列定义一个缝隙。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentence = [(\"the\", \"DT\"),\n",
    "            (\"little\", \"JJ\"),\n",
    "            (\"yellow\", \"JJ\"),\n",
    "            (\"dog\", \"NN\"),\n",
    "            (\"barked\", \"VBD\"),\n",
    "            (\"at\", \"IN\"),\n",
    "            (\"the\", \"DT\"),\n",
    "            (\"cat\", \"NN\")]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(S\n  (NP the/DT little/JJ yellow/JJ dog/NN)\n  barked/VBD\n  at/IN\n  (NP the/DT cat/NN))\n"
     ]
    }
   ],
   "source": [
    "# 先分块，再加缝隙，才能得出正确的结果\n",
    "grammar = r'''\n",
    "    NP: \n",
    "        {<.*>+}         # Chunk everything （先对所有数据分块）\n",
    "        }<VBD|IN>+{     # Chink sequences of VBD and IN（对 VBD 或者 IN 加缝隙）\n",
    "'''\n",
    "print(nltk.RegexpParser(grammar).parse(sentence))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(S\n  (NP\n    the/DT\n    little/JJ\n    yellow/JJ\n    dog/NN\n    barked/VBD\n    at/IN\n    the/DT\n    cat/NN))\n"
     ]
    }
   ],
   "source": [
    "# 先加缝隙，再分块，就不能得出正确的结果，只会得到一个块，效果与没有使用缝隙是一样的\n",
    "grammar = r'''\n",
    "    NP: \n",
    "        }<VBD|IN>+{     # Chink sequences of VBD and IN\n",
    "        {<.*>+}         # Chunk everything\n",
    "'''\n",
    "print(nltk.RegexpParser(grammar).parse(sentence))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "(S\n  (NP\n    the/DT\n    little/JJ\n    yellow/JJ\n    dog/NN\n    barked/VBD\n    at/IN\n    the/DT\n    cat/NN))\n"
     ]
    }
   ],
   "source": [
    "grammar = r'''\n",
    "    NP: \n",
    "        {<.*>+}         # Chunk everything\n",
    "'''\n",
    "print(nltk.RegexpParser(grammar).parse(sentence))"
   ]
  },
  {
   "source": [
    "### 7.2.6 分块的表示：标记与树状图\n",
    "\n",
    "作为「标注」和「分析」之间的中间状态（Ref：Ch8），块结构可以使用标记或者树状图来表示\n",
    "\n",
    "使用最为广泛的表示是IOB标记：\n",
    "\n",
    "-   I（Inside，内部）；\n",
    "-   O（Outside，外部）；\n",
    "-   B（Begin，开始）。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  }
 ]
}